# Image-Text Matching

Sail Clip Hendrix 10epochs
A vision-language model fine-tuned from openai/clip-vit-large-patch14, trained for 10 epochs
Text-to-Image Transformers
S
cringgaard
49
0
Video Llava
A large-scale vision-language model based on Vision Transformer architecture, supporting cross-modal understanding between images and text
Text-to-Image
V
AnasMohamed
194
0
Vilt Finetuned 200
Apache-2.0
Vision-language model based on ViLT architecture, fine-tuned for specific tasks
Text-to-Image Transformers
V
Atul8827
35
0
Clip Vit Large Patch14
OpenAI's open-source CLIP model, based on Vision Transformer (ViT) architecture, supporting joint understanding of images and text.
Text-to-Image Transformers
C
Xenova
17.41k
0
CLIP Giga Config Fixed
MIT
A large CLIP model trained on the LAION-2B dataset, using ViT-bigG-14 architecture, supporting cross-modal understanding between images and text
Text-to-Image Transformers
C
Geonmo
109
1
Japanese Cloob Vit B 16
Apache-2.0
Japanese CLOOB (Contrastive Leave-One-Out Boost) model trained by rinna Co., Ltd. for cross-modal understanding of images and text
Text-to-Image Transformers Japanese
J
rinna
229.51k
12
Clip Vit Large Patch14 336
A large-scale vision-language pretrained model based on the Vision Transformer architecture, supporting cross-modal understanding between images and text
Text-to-Image Transformers
C
openai
5.9M
241
Distilbert Base Turkish Cased Clip
A Turkish text encoder fine-tuned from dbmdz/distilbert-base-turkish-cased, designed to work with CLIP's ViT-B/32 image encoder
Text-to-Image Transformers
D
mys
2,354
1
Clip Vit B 32 Japanese V1
This is a Japanese CLIP text/image encoder model converted from the English CLIP model through distillation techniques.
Text-to-Image Transformers Japanese
C
sonoisa
690
21
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase